Search CORE

14 research outputs found

Analytical cost metrics: days of future past

Author: Prajapati Nirmal
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2019
Field of study

2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

Mountain Scholar (Digital Collections of Colorado and Wyoming)

BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Author: Abdelkhalik Hamdy
Aktar Shamminuj
Arafa Yehia
Badawy Abdel-Hameed
Barai Atanu
Chennupati Gopinath
Eidenbenz Stephan
Panda Nishant
Prajapati Nirmal
Santhi Nandakishore
Turja Nazmul Haque
Publication venue
Publication date: 11/11/2023
Field of study

Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.Comment: Accepted at the 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2023

arXiv.org e-Print Archive

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Author: Ao Y.
Bondhugula U.
Bondhugula Uday
Chi Y.
de Fine Licht Johannes
Grosser Tobias
Grosser Tobias
Grosser Tobias
Hagedorn Bastian
Holewinski Justin
Irigoin F.
Kamil Shoaib
Konstantinidis E.
Krishnamoorthy Sriram
Maruyama Naoya
Meng Jiayuan
Muranushi Takayuki
Nguyen A.
Prajapati Nirmal
Ravishankar Mahesh
Rawat P. S.
Rawat Prashant
Rawat Prashant Singh
Rawat Prashant Singh
Rawat Prashant Singh
Rossinelli Diego
Shimokawabe Takashi
Shimokawabe Takashi
Tang W. T.
Tang Yuan
Verdoolaege Sven
Verdoolaege Sven
Verdoolaege Sven
Williams Samuel
Wolfe M.
Zohouri H. R.
Zohouri Hamid Reza
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/02/2020
Field of study

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU

arXiv.org e-Print Archive

Crossref

Spin-Hall effect of light at a tilted polarizer

Author: Bliokh Konstantin
Nori F.
Prajapati Chandravati
Samlan C. T.
Viswanathan Nirmal K.
Publication venue: 'The Optical Society'
Publication date: 19/12/2019
Field of study

We describe the spin-Hall effect of light (as well as the angular Goos-Hänchen effect) at a tilted linear-dichroic plate, such as a usual linear polarizer. Although the spin-Hall effect at a tilted polarizer was previously associated with the geometric spin-Hall effect of light (which was contrasted to the regular spin-Hall effect) [Phys. Rev. Lett. 112, 113902 (2014)], we show that the effect is actually an example of the regular spin-Hall effect that occurs at tilted anisotropic plates [Optica 3, 1039 (2016)]. Moreover, our approach reveals the angular spin-Hall shift, which is absent in the “geometric” approach. We verify our theory experimentally using the method of quantum weak measurements.Air Force Office of Scientific Research (FA9550- 14-1-0040); Army Research Office (W911NF-18-1-0358); Core Research for Evolutional Science and Technology (JPMJCR1676); Japan Science and Technology Agency (QLEAP); Japan Society for the Promotion of Science (VS.059.18N); John Templeton Foundation; Science and Engineering Research Board (TAR/2018/000552); Australian Research Council; Science and Engineering Research Board (SERB), India; Asian Office of Aerospace Research and Development (AOARD) (FA2386-18-1-4045)

The Australian National University

Energy Modeling and Optimization for Tiled Nested-Loop Codes

Author: Andonov Rumen
Djidjev Hristo
Prajapati Nirmal
Rajopadhye Sanjay
Ranasinghe Waruna
Tandrapati Vamsi Krishna
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/05/2015
Field of study

International audienceWe develop a methodology for modeling the energy efficiency of tiled nested-loop codes running on a graphics processing unit (GPU) and use it for energy efficiency optimization. % We use the polyhedral model, a We assume that a highly optimized and parametrized version of a tiled nested -- loop code, either written by an expert programmer or automatically produced by a polyhedral compilation tool -- is given to us as an input. We then model the energy consumption as an analytical function of a set of parameters characterizing the software and the GPU hardware. Most previous attempts at GPU energy modeling were based on low-level machine models that were then used to model whole programs through simulations, or were analytical models that required low level details. In contrast, our approach develops analytical models based on (i) machine and architecture parameters, (ii) program size parameters as found in the polyhedral model and (iii) tiling parameters, such as those that are chosen by auto-or manual tuners. Our model therefore allows efficient optimization of the energy efficiency with respect to a set of parameters of interest. We illustrate the framework on three nested-loop codes: Smith-Waterman, and one-dimensional and two-dimensional Jacobi stencils, and analyze the accuracy of the resulting models. We also show that the models can be used for optimal tile-size selection for energy efficiency. With optimal choice of model parameters the RMS error is less than 4%. Two factors allow us to attain this high accuracy. The first is domain-specificity: we focus only on tile-able nested-loop codes. The second is that we decouple the energy model from a model of the execution time, a known hard problem

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Spin-Hall effect and circular birefringence of a uniaxial crystal plate

Author: Bliokh Konstantin
Nori Franco
Prajapati Chandravati
Puentes Graciana
Samlan C. T.
Viswanathan Nirmal K.
Publication venue: Optical Society of American (OSA)
Publication date: 27/01/2020
Field of study

We demonstrate theoretically and experimentally the fine lateral circular birefringence of uniaxial crystal plates, an example of the spin–Hall effect of light. We report experimental observations this effect using polarimetric and quantum-weak-measurement techniques

The Australian National University

Transformations for Energy Efficient Accelerated Chain Matrix Multiplication (TEE-ACM 2)

Author: Berger Richard
Lakshmiranganatha Sumathi
Lim Hyun
Loiseau Julien
Mccormick Patrick
Mohd-Yusof Jamal
Moraru Maxim
Prajapati Nirmal
Ramakrishnaiah Vinay
Tsai Karen
Warnet Mina
Publication venue: HAL CCSD
Publication date: 13/11/2022
Field of study

International audienceGPU matrix chain multiplication serves as a basis for a wide range of scientific domains like computer graphics, physics, and machine learning. While its time performance was studied for years, there has been significantly less effort in optimizing its energy efficiency. GPU power consumption is heavily impacted by the number of data transfers performed. In fact, a data transfer from global memory needs a thousand times more energy than a double precision arithmetic operation. Thus, minimizing data transfers is key for reducing the energy consumption. We present an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation as well as off-chip data transfers. For this, optimizations at three different levels are provided. For a single matrix multiplication, we use a blocking strategy that allows us to achieve the minimum number of global memory loads for a given amount of shared memory. We extend our approach to three matrices to decrease the data transfers even further. Finally, we use a parenthesizing algorithm that minimizes the number of computations as well as memory transfers for a whole sequence of matrices

Hal-Diderot